Skip to content

fix(ai): stop discarding exploit mode; wire + document it (D2)#1263

Merged
ocervell merged 1 commit into
ai-resiliencyfrom
fix/exploit-mode-wiring
Jul 2, 2026
Merged

fix(ai): stop discarding exploit mode; wire + document it (D2)#1263
ocervell merged 1 commit into
ai-resiliencyfrom
fix/exploit-mode-wiring

Conversation

@ocervell

@ocervell ocervell commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Finding — D2: exploit mode half-wired (P4 Pertinence)

The selection prompt offers exploit, and the full mode exists
(MODES["exploit"], SYSTEM_EXPLOIT, get_system_prompt branch,
modes/_selection.txt classifies attack/chat/exploit), but detection
threw the classification away and the opt help omitted it.

Root cause

  • secator/tasks/ai.py:764 (pre-change) — _detect_mode accepted only
    ("attack", "chat") from the intent LLM; an exploit verdict hit the
    else and reverted to old_mode or "chat". Since D4's fast_detect_mode
    already defers exploit-ish prompts to the LLM, the LLM could return
    exploit — it was just discarded here.
  • secator/tasks/ai.py:74mode opt help hardcoded "Mode: attack or chat".

Changes

  • _detect_mode: accept any mode in MODES (adds exploit; attack/chat
    behavior identical). secator/tasks/ai.py:764
  • mode opt help derived from MODES.keys() — single source of truth, no
    drift (f"Mode: {', '.join(MODES)}"). secator/tasks/ai.py:74
  • Imported MODES into the task module (DRY; no third hardcoded mode tuple).

No change to the exploit SAFETY posture — exploit still runs through the same
PermissionEngine/guardrails; only detection + docs changed.

Exploit prompt is real

secator/ai/prompts/modes/exploit.txt is a full 43-line template (persona =
"exploitation verification specialist", methodology, add_finding
exploitation-report flow, ${guardrails}/${isolation} constraints).
get_system_prompt("exploit", ...) renders with no leftover ${include} or
$template_var placeholders (D1's $query_types/$output_types_reference
substitution covers exploit too).

Tests

Added focused tests (tests/unit/test_ai_task_opts.py,
tests/unit/test_ai_prompts.py):

  • LLM _detect_mode verdict of exploit now sets self.mode == "exploit"
    (previously fell back to chat).
  • attack / chat / unknown verdicts unchanged (unknown → chat fallback).
  • get_system_prompt("exploit") renders clean (no unresolved placeholders).
  • mode opt help lists every mode incl. exploit.

Baseline vs after (test_ai_loop.py test_ai_session.py test_ai_prompts.py):
baseline 12 failed / 85 passed → after 12 failed / 86 passed. The 12
failures are identical pre-existing env failures (shfmt/safecmd sandbox), not
regressions; the +1 pass is the new exploit-render test.

Related smells (not fixed here)

  • secator/ai/prompts.py:248if mode in ("attack", "exploit"): hardcodes
    the "uses library reference" set; a new library-using mode would need a
    manual edit. Candidate to derive from mode config.
  • secator/tasks/ai.py:66 — class docstring still says "(attack or chat
    mode)"; omits exploit.
  • secator/ai/prompts/modes/_selection.txt hardcodes the three mode names in
    prose — can drift from MODES if a mode is added/removed.

🤖 Generated with Claude Code

`_detect_mode` accepted only ("attack","chat") from the intent LLM and
discarded an "exploit" classification (fell back to old_mode/chat), even
though the full exploit mode exists (MODES entry, SYSTEM_EXPLOIT prompt,
get_system_prompt branch, _selection.txt classifies it). So exploit mode
was unreachable by detection and the `mode` opt help omitted it.

- _detect_mode: accept any mode in MODES (incl. exploit); attack/chat
  unchanged.
- `mode` opt help derived from MODES.keys() (DRY, no drift).
- Tests: LLM "exploit" verdict now sets mode=exploit; attack/chat/unknown
  unchanged; exploit system prompt renders with no leftover template vars.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a481a44-c097-4ea5-84b6-ede5f6f34b62

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/exploit-mode-wiring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ocervell ocervell merged commit 2a9f25a into ai-resiliency Jul 2, 2026
1 check passed
@ocervell ocervell deleted the fix/exploit-mode-wiring branch July 2, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant